Assignment instructions:

Assignment submission (YOUR NAME): Jordan Sibley


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
library(interactions) 

DATA SOURCE:

Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 11/17/2019.


Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is centris paribus or whether selection bias is likely (be specific!).

Revised answer: The control sites, (Arroyo Quemado, Carpenteria, and Mohawk) do not likely provide a strong counterfactual for our treatment sites (Naples, Isla Vista). A strong counterfactual in an experimental design usually has a random assignment, resembles the treatment site as similarly as possible, and can control for confounding variables. A marine site does not become a designated protected area randomly, and there is a lot of bureaucracy and different motivations for choosing certain sites. Additionally, even though these sites are close together on the map, reefs can vary in their composition and environmental variables despite their proximity. Because of these reasons, the control sites are not likely strong counterfactuals, however, in natural experiments sometimes you have to work with what you have.


Step 2: Read & wrangle data

a. Read in the raw data. Name the data.frame (df) rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

# Read in spiny lobs data,, clean column names, and convert NA values  
rawdata <- read_csv(here::here('data', 'spiny_abundance_sb_18.csv')) |> 
    janitor::clean_names() |> 
    mutate(size_mm = na_if(size_mm, -99999))

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
# Create new df with factored sites 
tidydata <- rawdata |> 
    mutate(reef = factor(site, 
                         levels = c("AQUE", "CARP", "MOHK", "IVEE",  "NAPL"),
                         labels = c("Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples")))

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

#HINT(e): Use `case_when()` to create the 3 new variable columns

spiny_counts <- tidydata |>
    # group by to get total count of lobs per transect 
    group_by(year, site, transect) |> 
    summarise(counts = sum(count),
              mean_size = mean(size_mm, na.rm = TRUE)) |> 
    # create new column that gives protection status 
    mutate(mpa = case_when(site %in% c('AQUE', 'CARP', 'MOHK') ~ 'non-MPA',
    TRUE ~ 'MPA')) |> 
    # create column where mpa = 1 and non-mpas = 0 
    mutate(treat = case_when(mpa == 'non-MPA' ~ 0, 
                             mpa == 'MPA' ~ 1)) |> 
    ungroup()

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# plot 1: ridge plot of lobs counts grouped by reef site (with mean counts)

# Calculate mean counts 
mean_counts <- spiny_counts |> 
    group_by(site) |> 
    summarize(mean_count = mean(counts)) |> 
    ungroup() # un group variables 
    

# Plot 
spiny_counts %>% 
ggplot(aes(x = counts, y = site, fill = site)) + 
    ggridges::geom_density_ridges() + 
    # add text of mean counts 
    geom_text(data = mean_counts, 
              aes(x = max(spiny_counts$counts) + 2,
                  y = (site),
              label = paste0("Mean counts: ", round(mean_count, 1))),
              hjust = 1,
              vjust = -.8,
              size = 3) +
    theme_light() + 
    # remove legend 
    theme(legend.position = "none") + 
    labs(x = "Lobster counts per transect",
         y = "LTER Site",
         title = "Density of Spiny Lobsters Counts Per LTER Site")

# plot 2: histogram lobs counts grouped by MPA status (with median counts)

# Calculate medians for each mpa group
medians <- spiny_counts %>%
  group_by(mpa) %>%
  summarize(median_count = median(counts))

# Plot 
spiny_counts |> 
    ggplot(aes(x = counts)) +
    geom_histogram(fill = 'lightblue',
                   col = 'darkblue') + 
    facet_wrap(~mpa) + 
    # lines that show median 
    geom_vline(data = medians, aes(xintercept = median_count), 
               color = 'darkred', linetype = "dashed", size = 1) + 
    # median label 
    geom_text(data = medians, aes(x = median_count, y = -Inf, 
                                  label = paste0("Median: ", median_count)),
              hjust = -0.3, vjust = -8, color = "darkred", size = 4) + 
    # labels 
    labs(x = 'Lobster counts per transect',
         y = 'Frequency',
         title = 'Histogram of Lobster Counts per LTER sites (MPA vs Non-MPA)') + 
    theme_light()

# plot 3: jitter plot lobs counts grouped by year (with max count)

# Identify max counts for each year
counts_max <- spiny_counts %>%
  group_by(year) %>%
  mutate(is_max = counts == max(counts)) %>%
  ungroup()

# Plot 
counts_max |> 
    ggplot(aes(x = year, y = counts, col = year)) + 
    geom_jitter() + 
    # set x axis 
    scale_x_continuous(breaks = seq(2012, 2018, by = 1)) + 
    # highlight the max count of each day 
    gghighlight::gghighlight(is_max, label_key = counts)  +
    # labels 
    labs(x = "Year",
         y = "Lobster Counts",
         title = "Maximum Lobster Counts per Year") + 
    theme_minimal()

# plot 4: Violin plot of lobs size_mm grouped by mpa status 
spiny_counts |> 
  ggplot(aes(x = mpa, y = mean_size, fill = mpa)) +
  geom_violin() + 
  geom_boxplot(width = 0.1, color = "black", alpha = 0.5) + 
  scale_fill_manual(values = c("non-MPA" = "#FF9999", "MPA" = "#99CCFF")) + 
  theme_minimal() + 
  theme(legend.position = "none") + # Remove the legend
  labs(x = "MPA Status",
       y = "Mean Size (mm)",
       title = "Comparison of Mean Lobster Size in MPA and Non-MPA Areas"
  )

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()

# 0 = Non-MPA , 1 = MPA 
spiny_counts |> 
    gtsummary::tbl_summary(
        by = treat, 
        statistic = list(all_continuous() ~ "{mean} ({sd})") # mean and SD
    )
Characteristic 0
N = 133
1
1
N = 119
1
year

    2012 19 (14%) 17 (14%)
    2013 19 (14%) 17 (14%)
    2014 19 (14%) 17 (14%)
    2015 19 (14%) 17 (14%)
    2016 19 (14%) 17 (14%)
    2017 19 (14%) 17 (14%)
    2018 19 (14%) 17 (14%)
site

    AQUE 49 (37%) 0 (0%)
    CARP 63 (47%) 0 (0%)
    IVEE 0 (0%) 56 (47%)
    MOHK 21 (16%) 0 (0%)
    NAPL 0 (0%) 63 (53%)
transect

    1 21 (16%) 14 (12%)
    2 21 (16%) 14 (12%)
    3 21 (16%) 14 (12%)
    4 14 (11%) 14 (12%)
    5 14 (11%) 14 (12%)
    6 14 (11%) 14 (12%)
    7 14 (11%) 14 (12%)
    8 7 (5.3%) 14 (12%)
    9 7 (5.3%) 7 (5.9%)
counts 23 (39) 28 (44)
mean_size 73 (7) 76 (7)
    Unknown 15 12
mpa

    MPA 0 (0%) 119 (100%)
    non-MPA 133 (100%) 0 (0%)
1 n (%); Mean (SD)

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(counts ~ treat, data = spiny_counts)
 
summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

The results of the OLS linear regression model of lobster counts regressed on treatment gave an intercept of 22.73. The intercept means that when treatment is zero (or the site is not an MPA), the average lobster counts per transect is 22.73. The predictor coefficient is 5.36. This means that when treatment is equal to 1 (the site is an MPA), the average lobsters counted is 5.36 more than when treatment is zero, which is ~ 28 lobsters per transect.

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

check_model(m1_ols,  check = "qq" )

check_model(m1_ols, check = "normality")

check_model(m1_ols, check = "homogeneity")

check_model(m1_ols, check = "pp_check")

The above plots help to determine if the assumptions of OLS are met by this model. The first one is a QQ plot, which measures how similar the distribution of the data is to the therotical distribution, which can be used to assess the normality of the residuals (an assumption of OLS). Ideally, the plots would fall along the straight vertical line, but the plot above shows that they do not, meaning that the residuals of the data are not normal. The next plot also looks at the normality of the residuals. If the residuals where normal, they would resemble the normal line, but we can see that the density of the residuals is very heavily centered near below zero, and then there is a sharp decrease and it tapers out, which doesn’t exactly fit the normal line.

The third plot measures the homogeneity of variance. The OLS assumption is that the variance of the errors is consistent for all observations. In the plot of the fitted values versus the residuals, the line should be flat and horizontal, but the plot above shows a parabola shape, meaning the variance is not homogeneous. The last plot is a posterior predictive check, which basically just shows the model predicted values versus the true values. If the model is a good fit, the lines show resemble each other. However, this plot shows that the predicted values are not steep enough around the zero counts mark to fit the data.


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

d. Compare results with previous model, explain change in the significance of the treatment effect

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

# Poisson regression model 
m2_pois <- glm(counts ~ treat, data = spiny_counts,  family= poisson(link = 'log'))

# Summary of model 
summ(m2_pois, model.fit = FALSE) # pred var = 0.21 
Observations 252
Dependent variable counts
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE
# Percent change 
exp(0.21) -1
## [1] 0.2336781

This model is of lobster counts regressed on treatment, but as a Poisson regression model. Poisson differs from a OLS model in that Poissons are designed to analyze count data while OLS usually works better for continuous dependent variable. The results of the model give an Intercept of 22.73, and the predictor intercept is 1.24. This means that the average lobster counts for non-MPA sites is 22.73 and the average for the MPA sites is 23.96 (1.24 more). The calculation of the incident ratio rate shows a 23% increase in lobster counts between the non-MPA sites and MPA sites.

In a Poisson models, you need to check for overdispersion. Dispersion refers to the relationship between the mean and the variance. An assumption of the Poisson model is that the mean is equal to the dispersion (or variance). Overdispersion means that variance in the data is greater than the mean. If this model is not overdispersed, than the mean of the counts should equal the variance. Overdispersion is going to be checked down below.

e. Check the model assumptions. Explain results.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

check_model(m2_pois)

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001
check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

The code check_model() provides a visual check of the model assumptions like checking normality of residual, normality of random effects, linear relationship, variance, and multicollinearity. The assumptions of a Poisson model are that Y follows a Poisson distribution with non-negative integers and that variance (dispersion) is proportional to the mean.

The posterior predictive check that is used to determine discrepancies between real and predicted data shows that when lobster count is low, the actual and model predicted values are not a great fit, but once count increased the model and actual points match up really well. The homogeneity of variance plot reveals that the points are not randomly spread above and below the reference line, and instead are found only clustered at two values of fitted values. This indicates that the variance is not homoscedastic. Another plot that stood out in this check is the uniformity of residuals. Ideally, the points should fall along the diagonal lone, but we can see that the dots follow more of logistic curve.

The overdispersion test indicates that overdispersion is detected in this model. The results give a dispersion ratio of 67.03, when you are usually looking to have this ratio be close to 1 if the model is a good fit. Additionally, the Pearson’s Chi squared test gives a very large value (16,758), which indicates a poor fit.

The results of the zero inflation test also reveal that this model is a poor fit. This checks if the amount of observed zeros is larger than the amount of predicted zeros. The results show that the predicted zeros should be 0, but the observed zeros is 27. This means that the model is under fitting zeros, and is therefor a bad fit.

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

# NOTE: The `glm.nb()` function does not require a `family` argument

# Negative binomial model 
m3_nb <- glm.nb(counts ~ treat, data = spiny_counts)

# Summary of model 
summ(m3_nb, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE

A negative binomial model is used to analyze count data where the variance is greater than the mean. Since the last model indicated that overdispersion was detected in the data. This model might be a good fit.

The results of this model are similar to that of Poisson model in that the results of the intercept and predictor intercept are both 22.73 and 0.21 respectively. Now we need to see if the assumptions of this model are met.

TALK ABOUT P VALUES COMPARE MODELS

check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.400
##           p-value = 0.064
check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb)

check_model(m3_nb)

The results of the model assumption checks indicate that this model seems to be a better for the data. The overdispersion check shows no overdispersion is not detected (dispersion ratio ~ 1). The zero inflation test showed that the observed zeros is 27 and the predicted zeros is 30 which is not a perfect match, but it is much better than the Poisson model. The posterior predictive check shows that the observed data and the model predicted data are a very close fit, especially when comparing it to the Poisson model. And finally the other plots present in the check_model test show a much closer fit of the observed residuals and the predicted residuals and the uniformity of residuals almost an exact fit. In comparison to the Poisson model the tests are revealing that the negative binomial model is a much better fit for the lobster counts data.


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

# Compare 3 regression models 
export_summs(m1_ols, m2_pois, m3_nb, 
              model.names = c("OLS","Poisson", "NB"),
              statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

The code above compares the intercepts and predictor intercepts of the three different models of lobster counts regressed against treatment (MPA or non-MPA). The results show that the intercept, or average lobster count when site is a non-MPA is 22.73. The intercept for Poisson and negative binomial (NB) is 3.12, but you must exponentiate it to get the predicted count which is ~ 22.6. The effect of treatment (MPA site) for the OLS model is an average increase of 5.36 lobster counts while for the Poisson and NB models predictor intercepts are 0.21 (which is the log of the IRR) so approximately a 23% increase in predicted lobster count. The results of this test do demonstrate that the intercept is statistically significant for all three models, but for the predictor intercept only the Poisson value is statistically significant. The results indicate that the treatment effect is not robust across the models since the statistical significance varies.


Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following negative binomial model using glm.nb()

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

d. Explain why the main effect for treatment is negative? *Does this result make sense?

ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
     
m5_fixedeffs <- glm.nb(
     counts ~ treat + year + treat*year,
     data = ff_counts)
 
summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

This negative binomial regression model is different than the first one because this one adds a fixed effect for the year (which is factored, which allows the model to compare each year to the reference year 2012) and an interaction term between year and treat.

The results of the model provides a comparison for each sub group. It compares the average of each year with each treatment type. The coefficients that are year—- represent how lobster counts in the non-MPA sites changed from the baseline year 2012. The general trend is that it steadily increases from from 2013 to 2017/18. The coefficients that are treat:year—- describe the treatment effect varies by year. The positive values suggest that the lobster counts of the MPA sites increased compared to non-MPAs in those years. The baseline year 2012, is negative, which means that in that year there were more lobster counts in the non-MPA sites than the MPA sites. The result makes sense as the Naples MPA was established in 2010 and the Campus Point MPA was established in 2012. The impacts of protection has a bit of a lag since the lobster populations likely needed time to recover and benefit from protection.

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (b) above.

interact_plot(m5_fixedeffs, pred = year, modx = treat,
               outcome.scale = "link") # NOTE: y-axis on log-scale

# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts

The plot re affirms what I had observed in the model before. In the baseline year, 2012, there are more average lobster counts in the non-MPA sites and then by the 2018 the MPA sites have more counts. One thing this plot does reveal is that from 2013 to 2017, both the MPA and non-MPA sites average counts are increasing at roughly the same rate.

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- spiny_counts |> 
    group_by(year, mpa) |> 
    summarise(mean_count = mean(counts), .groups = "drop") |> 
    mutate(year=as_factor(year))

plot_counts %>% ggplot(aes(x = year, y = mean_count, group = mpa, color = mpa)) + 
    # line and dots 
    geom_line(size = 1) + 
    geom_point(size = 3) + 
    # colors 
    scale_color_manual(values = c("non-MPA" = 'darkblue',
                                  "MPA" = 'lightblue')) +
    # labels 
    labs(x = "Year", 
         y = "Mean lobster counts per transect", 
         title = "Average Lobster Counts in SBC LTER Sites (2012-2018)",
         color = "Protection Type") + 
    theme_minimal() + 
    theme(legend.position = "bottom")


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

  2. Explain why spillover is an issue for the identification of causal effects

  3. How does spillover relate to impact in this research setting?

  4. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption
    2. Excludability assumption

I had talked a bit about this spill over effect at the beginning of this assignment when I first was anticipating types of bias. Spillover effects ar every likely in this research context. The reason why it might serve as in issue for identification of causal effects is that the lobster (and especially their larvae) are not confined to an individual LTER sites. In fact, there is evidence that the spill over effect is benefiting the spiny lobster fishery in southern California. One of my TA’s from undergrad, Jordan Gallagher, would often talk about the research he did on this topic using this same dataset that we are now. https://www.nature.com/articles/s41598-021-82371-5 Spillover interferes with identifying a treatment’s causal inference as it breaks the assumption of no interference, or it goes against the Unit Treatment Value Assumption (SUTVA). In this study, it is impossible to tell from this data if the added benefit of protection in MPA sites didn’t also benefit the non-MPA sites through the spillover effect. It potentially could explain why from 2012 to 2017/8, average lobster counts increased in both MPA and non-MPA sites. Additionally, we can’t tell if this study adheres to the excludability assumpion, which states that the only way the treatment affects the outcome is through the intended causal pathway (the treatment itself) and not by other means. It might have been valuable to measure variables that impact lobster populations like availability of prey and environmental parameters to get a better understanding if MPA treatment is the true causal effect on lobster counts.


EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (lobster_sbchannel_24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)